cutoff date
LLMLagBench: Identifying Temporal Training Boundaries in Large Language Models
Pęzik, Piotr, Kaczyński, Konrad, Szymańska, Maria, Żarnecki, Filip, Deckert, Zuzanna, Kwiatkowski, Jakub, Janowski, Wojciech
Large Language Models (LLMs) are pretrained on textual data up to a specific temporal cutoff. This creates a strict knowledge boundary beyond which models cannot provide accurate information without querying external sources. More subtly, when this limitation is unknown or ignored, LLMs may inadvertently blend outdated time-sensitive information with general knowledge during reasoning tasks, potentially compromising response accuracy. We introduce LLMLagBench, an LLM freshness benchmark, as a systematic approach for identifying the earliest probable temporal boundaries of an LLM's training data by evaluating its knowledge of recent events. We then apply this benchmark to evaluate a large set of LLMs, including models with both explicitly declared and undeclared training cutoffs. The reliability of the benchmark is assessed by manual validation and comparison with publicly released information about LLM pretraining.
- Europe > Austria > Vienna (0.14)
- Europe > Ukraine (0.14)
- Europe > Poland > Łódź Province > Łódź (0.04)
- (8 more...)
Memorization: A Close Look at Books
Ma, Iris, Domingo, Ian, Krone-Martins, Alberto, Baldi, Pierre, Lopes, Cristina V.
To what extent can entire books be extracted from LLMs? Using the Llama 3 70B family of models, and the "prefix-prompting" extraction technique, we were able to auto-regressively reconstruct, with a very high level of similarity, one entire book (Alice's Adventures in Wonderland) from just the first 500 tokens. We were also able to obtain high extraction rates on several other books, piece-wise. However, these successes do not extend uniformly to all books. We show that extraction rates of books correlate with book popularity and thus, likely duplication in the training data. We also confirm the undoing of mitigations in the instruction-tuned Llama 3.1, following recent work (Nasr et al., 2025). We further find that this undoing comes from changes to only a tiny fraction of weights concentrated primarily in the lower transformer blocks. Our results provide evidence of the limits of current regurgitation mitigation strategies and introduce a framework for studying how fine-tuning affects the retrieval of verbatim memorization in aligned LLMs.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Gulf of Mexico > Central GOM (0.04)
- Asia > Singapore (0.04)
- (3 more...)
- Law (1.00)
- Information Technology > Security & Privacy (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (0.65)
ExAnte: A Benchmark for Ex-Ante Inference in Large Language Models
Liu, Yachuan, Wei, Xiaochun, Shi, Lin, Li, Xinnuo, Zhang, Bohan, Dhillon, Paramveer, Mei, Qiaozhu
Large language models (LLMs) face significant challenges in ex-ante reasoning, where analysis, inference, or predictions must be made without access to information from future events. Even with explicit prompts enforcing temporal cutoffs, LLMs often generate outputs influenced by internalized knowledge of events beyond the specified cutoff. This paper introduces a novel task and benchmark designed to evaluate the ability of LLMs to reason while adhering to such temporal constraints. The benchmark includes a variety of tasks: stock prediction, Wikipedia event prediction, scientific publication prediction, and Question Answering (QA), designed to assess factual knowledge under temporal cutoff constraints. We use leakage rate to quantify models' reliance on future information beyond cutoff timestamps. Experimental results reveal that LLMs struggle to consistently adhere to temporal cutoffs across common prompting strategies and tasks, demonstrating persistent challenges in ex-ante reasoning. This benchmark provides a potential evaluation framework to advance the development of LLMs' temporal reasoning ability for time-sensitive applications.
- Europe > Italy (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Germany (0.04)
- (4 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Banking & Finance > Trading (0.94)
- Information Technology > Services (0.67)
Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle
Dai, Hui, Teehan, Ryan, Ren, Mengye
Many existing evaluation benchmarks for Large Language Models (LLMs) quickly become outdated due to the emergence of new models and training data. These benchmarks also fall short in assessing how LLM performance changes over time, as they consist of static questions without a temporal dimension. To address these limitations, we propose using future event prediction as a continuous evaluation method to assess LLMs' temporal generalization and forecasting abilities. Our benchmark, Daily Oracle, automatically generates question-answer (QA) pairs from daily news, challenging LLMs to predict "future" event outcomes. Our findings reveal that as pre-training data becomes outdated, LLM performance degrades over time. While Retrieval Augmented Generation (RAG) has the potential to enhance prediction accuracy, the performance degradation pattern persists, highlighting the need for continuous model updates.
- Asia > South Korea (0.28)
- Asia > North Korea (0.14)
- North America > United States > New York (0.04)
- (6 more...)
- Government > Regional Government (0.92)
- Media (0.68)
- Leisure & Entertainment > Sports > Olympic Games (0.67)
Dated Data: Tracing Knowledge Cutoffs in Large Language Models
Cheng, Jeffrey, Marone, Marc, Weller, Orion, Lawrie, Dawn, Khashabi, Daniel, Van Durme, Benjamin
Released Large Language Models (LLMs) are often paired with a claimed knowledge cutoff date, or the dates at which training data was gathered. Such information is crucial for applications where the LLM must provide up to date information. However, this statement only scratches the surface: do all resources in the training data share the same knowledge cutoff date? Does the model's demonstrated knowledge for these subsets closely align to their cutoff dates? In this work, we define the notion of an effective cutoff. This is distinct from the LLM designer reported cutoff and applies separately to sub-resources and topics. We propose a simple approach to estimate effective cutoffs on the resource-level temporal alignment of an LLM by probing across versions of the data. Using this analysis, we find that effective cutoffs often differ from reported cutoffs. To understand the root cause of this observation, we conduct a direct large-scale analysis on open pre-training datasets. Our analysis reveals two reasons for these inconsistencies: (1) temporal biases of CommonCrawl data due to non-trivial amounts of old data in new dumps and (2) complications in LLM deduplication schemes involving semantic duplicates and lexical near-duplicates. Overall, our results show that knowledge cutoffs are not as simple as they have seemed and that care must be taken both by LLM dataset curators as well as practitioners who seek to use information from these models.
- North America > United States > New York (0.04)
- Europe > France (0.04)
- Media > Film (1.00)
- Leisure & Entertainment (1.00)
Evaluating Large Language Models for Generalization and Robustness via Data Compression
Li, Yucheng, Guo, Yunhao, Guerin, Frank, Lin, Chenghua
Existing methods for evaluating large language models face challenges such as data contamination, sensitivity to prompts, and the high cost of benchmark creation. To address this, we propose a lossless data compression based evaluation approach that tests how models' predictive abilities generalize after their training cutoff. Specifically, we collect comprehensive test data spanning 83 months from 2017 to 2023 and split the data into training and testing periods according to models' training data cutoff. We measure: 1) the compression performance on the testing period as a measure of generalization on unseen data; and 2) the performance gap between the training and testing period as a measure of robustness. Our experiments test 14 representative large language models with various sizes on sources including Wikipedia, news articles, code, arXiv papers, and multi-modal data. We find that the compression rate of many models reduces significantly after their cutoff date, but models such as Mistral and Llama-2 demonstrate a good balance between performance and robustness. Results also suggest that models struggle to generalize on news and code data, but work especially well on arXiv papers. We also find the context size and tokenization implementation have a big impact of on the overall compression performance.
- North America > United States > New Jersey (0.04)
- Europe > United Kingdom > England > Surrey (0.04)
- Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)
- Asia > China > Heilongjiang Province > Harbin (0.04)
Can ChatGPT discuss current events? Chatbot has clear knowledge cutoff date
During an appearance on "The Ingraham Angle," Jimmy Failla shares his thoughts on the latest interesting development in the world of artificial intelligence. ChatGPT has been a game changer for artificial intelligence, catapulting earlier this year to the fastest-growing web platform ever as millions of people across the world rushed to communicate with a system that can mimic human conversation. The system, however, is unable to respond to current events questions due to having a knowledge cutoff date of September 2021. When Fox News Digital, for example, attempted to ask ChatGPT questions about current events, such as if the Titan submersible implosion could have been prevented or what charges Hunter Biden was hit with this month, the chatbot responded that it does not have knowledge of current events after September 2021. "As an AI language model, I have a knowledge cutoff date because my training data only goes up until September 2021," ChatGPT responded when asked why it does not possess language beyond September 2021.
Sound Explanation for Trustworthy Machine Learning
Jia, Kai, Saowakon, Pasapol, Appelbaum, Limor, Rinard, Martin
We take a formal approach to the explainability problem of machine learning systems. We argue against the practice of interpreting black-box models via attributing scores to input components due to inherently conflicting goals of attribution-based interpretation. We prove that no attribution algorithm satisfies specificity, additivity, completeness, and baseline invariance. We then formalize the concept, sound explanation, that has been informally adopted in prior work. A sound explanation entails providing sufficient information to causally explain the predictions made by a system. Finally, we present the application of feature selection as a sound explanation for cancer prediction models to cultivate trust among clinicians.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
- Asia > Middle East > Israel (0.04)